About upper bounds on the complexity of Policy Iteration∗
نویسندگان
چکیده
We consider Acyclic Unique Sink Orientations of the n-dimensional hyper-cube (AUSOs), that is, acyclic orientations of the edges of the hyper-cube such that any sub-cube has a unique vertex of maximal in-degree. We study the Policy Iteration (PI) algorithm, also known as Bottom-Antipodal or Switch-All, to nd the global sink: starting from an initial vertex π0, i = 0, the outgoing links at the present vertex πi de ne a sub-cube of the AUSO. Policy Iteration jumps to the vertex πi+1 that is antipodal to πi in this sub-cube. This procedure is repeated until the sink is found. Finite-time convergence is guaranteed. Policy Iteration was shown to require at most 6 · 2 n n steps to converge by Mansour & Singh [MS99]. Our goal in this work is to improve this bound. Our rst contribution is to provide the rst improvement over Mansour & Singh's bound after fteen years, namely a (2 + o(1)) 2 n n upper bound (Section 2). Perhaps more importantly, we also show in Section 3 that this bound is optimal for an important relaxation of the problem. Policy Iteration was originally designed to solve Markov Decision Processes (MDPs) for which our new bound also holds. The algorithm and the bound also apply to 2-Player Turn-Based Stochastic Games (2TBSGs), a two player generalization of MDPs for which no polynomial-time algorithms are known. In MDPs and 2TBSGs, the vertices of an AUSO correspond to policies or strategies and the goal is to nd the policy that optimizes some objective function and that corresponds to the global sink. The partial order that shapes the policies of an MDP or a 2TBSG is an AUSO with some additional properties. The best known lower bounds for the number of steps of PI are given by Ω ( 7 √ 2 n) for MDPs [Fea10], by Ω ( 9 √ 2 n) for Parity Games, a special case of 2TBSGs [Fri09] and by Ω (√ 2 n) for AUSOs [SS05]. Establishing these bounds was a major milestone for the study of the Simplex algorithm for Linear Programming as they lead to exponential lower bounds for some critical pivoting rules [Fri11,FHZ11]. More details are given in Section 5. Our second contribution concerns an alternate relaxation of the upper bound problem proposed by Hansen & Zwick [Han12] that can essentially be formulated as follows. Let A ∈ {0, 1}m×n be a binary matrix such that for every pair of rows i, j of A with 1 ≤ i < j ≤ m, there exists a column k such that: Ai,k 6= Ai+1,k = Aj,k = Aj+1,k. The maximum number of rows of such a matrix A for a given number of columns n is also a bound for the number of steps of PI for an n-dimensional cube. Using exhaustive search for the number of columns of A ranging from 1 to 7, Hansen & Zwick obtained 2, 3, 5, 8, 13, 21,≥33 as its maximum number of rows. Given these observations, they conjectured that the maximum number of steps of PI should be given by Fn+2, the (n+ 2) nd Fibonacci number. From the numerical results, their conjecture appears like a perfect match, but con rming it for n = 7 has been claimed as an interesting, yet computationally hard challenge. It was proposed as January 2014's IBM Ponder This challenge. As our second contribution, we show Hansen & Zwick's conjecture wrong. In Section 4 we explain how we solved the IBM challenge by searching through every 7-column candidate matrix with nally nothing better than 33 rows. ∗A preliminary version of this work has been presented at the 25th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA'2014). †This work was supported by an ARC grant from the French Community of Belgium and by the IAP network 'Dysco' funded by the o ce of the Prime Minister of Belgium. The scienti c responsiblity rests with the authors. J.-C. D. is with CORE and NAXYS. R. M. J. is an F.R.S./FNRS Research Associate.
منابع مشابه
On the Complexity of Policy Iteration
Decision-making problems in uncertain or stochastic domains are often formulated as Markov decision processes (MD Ps). Pol icy iteration (PI) is a popular algorithm for searching over policy-space, the size of which is exponential in the number of states. We are interested in bounds on the complexity of PI that do not depend on the value of the discount factor. In this paper we prove the first...
متن کاملImproved Strong Worst-case Upper Bounds for MDP Planning
The Markov Decision Problem (MDP) plays a central role in AI as an abstraction of sequential decision making. We contribute to the theoretical analysis of MDP planning, which is the problem of computing an optimal policy for a given MDP. Specifically, we furnish improved strong worstcase upper bounds on the running time of MDP planning. Strong bounds are those that depend only on the number of ...
متن کاملExponential Lower Bounds for Policy Iteration
We study policy iteration for infinite-horizon Markov decision processes. It has recently been shown policy iteration style algorithms have exponential lower bounds in a two player game setting. We extend these lower bounds to Markov decision processes with the total reward and average-reward optimality criteria.
متن کاملImproved and Generalized Upper Bounds on the Complexity of Policy Iteration
Given a Markov Decision Process (MDP) with n states and m actions perstate, we study the number of iterations needed by Policy Iteration (PI)algorithms to converge to the optimal γ-discounted optimal policy. We con-sider two variations of PI: Howard’s PI that changes the actions in all stateswith a positive advantage, and Simplex-PI that only changes the action inthe sta...
متن کاملA Comparison of Iterated Optimal Stopping and Local Policy
5 A theoretical analysis tool, iterated optimal stopping, has been used as the basis of a numerical 6 algorithm for American options under regime switching [25]. Similar methods have also been proposed 7 for American options under jump diffusion [4] and Asian options under jump diffusion [5]. An alternative 8 method, local policy iteration, has been suggested in [27, 19]. Worst case upper bound...
متن کاملEstimating Upper and Lower Bounds For Industry Efficiency With Unknown Technology
With a brief review of the studies on the industry in Data Envelopment Analysis (DEA) framework, the present paper proposes inner and outer technologies when only some basic information is available about the technology. Furthermore, applying Linear Programming techniques, it also determines lower and upper bounds for directional distance function (DDF) measure, overall and allocative efficienc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014